Skip to content

Conversation

@rbeucher
Copy link
Member

@rbeucher rbeucher commented Dec 15, 2025

Overview

This PR adds support for using xarray Dataset and DataArray objects as direct inputs to ACCESS_ESM_CMORiser, enabling in-memory processing workflows while maintaining full backward compatibility with existing file-based workflows.

Problem Statement

The current CMORiser assumes a list of files as input:

cmoriser = ACCESS_ESM_CMORiser(
    input_paths=files,  # Must be file paths
    compound_name="Amon.rsds", 
    # ... other parameters
)

This creates limitations for workflows where:

  • Data is already loaded as xarray objects
  • Users want to process derived/computed variables
  • Integration with xarray-based analysis pipelines is needed
  • Temporary file I/O should be avoided for performance

Solution

New input_data Parameter

Added a new input_data parameter that accepts:

  • File paths (strings or lists) - same as before
  • xarray Dataset - used directly for CMORisation
  • xarray DataArray - automatically converted to Dataset

Usage Examples

With xarray Dataset

import xarray as xr
from access_moppy import ACCESS_ESM_CMORiser

# Load data as xarray Dataset
ds = xr.open_dataset("model_output.nc")

# Use directly with CMORiser  
cmoriser = ACCESS_ESM_CMORiser(
    input_data=ds,  # New parameter
    compound_name="Amon.rsds",
    experiment_id="historical",
    source_id="ACCESS-ESM1-5",
    variant_label="r1i1p1f1",
    grid_label="gn",
    activity_id="CMIP"
)

cmoriser.run()
cmoriser.write()

With xarray DataArray

# Extract specific variable as DataArray
temperature = ds.temp  

# Still works - automatically converted to Dataset
cmoriser = ACCESS_ESM_CMORiser(
    input_data=temperature,
    compound_name="Amon.tas",
    # ... other parameters
)

Backward Compatibility

Existing code continues to work unchanged:

# This still works exactly as before
cmoriser = ACCESS_ESM_CMORiser(
    input_paths=files,  # Old parameter - shows deprecation warning
    compound_name="Amon.rsds",
    # ... other parameters  
)

- Add new `input_data` parameter to accept xarray Dataset or DataArray objects
- Maintain full backward compatibility with existing `input_paths` parameter
- Automatically convert DataArrays to Datasets for processing
- Skip frequency validation for xarray inputs (data already loaded)
- Update all CMORiser subclasses (Atmosphere, Ocean OM2/OM3) to support new interface
- Preserve all existing functionality (resampling, chunking, validation)
- Add comprehensive parameter validation and deprecation warnings

This enables in-memory processing workflows and integration with xarray-based analysis pipelines while maintaining compatibility with existing file-based workflows.

All existing tests pass (34/34) confirming no breaking changes.
@rbeucher rbeucher requested a review from rhaegar325 December 15, 2025 04:23
@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

❌ Patch coverage is 13.35312% with 292 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.35%. Comparing base (5c87de5) to head (06845ea).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/access_moppy/utilities.py 4.93% 212 Missing ⚠️
src/access_moppy/base.py 27.63% 55 Missing ⚠️
src/access_moppy/atmosphere.py 22.22% 14 Missing ⚠️
src/access_moppy/driver.py 42.10% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #145      +/-   ##
==========================================
- Coverage   57.55%   52.35%   -5.21%     
==========================================
  Files          18       18              
  Lines        2403     2722     +319     
==========================================
+ Hits         1383     1425      +42     
- Misses       1020     1297     +277     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@rhaegar325 rhaegar325 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rbeucher for this PR, the overall structure is very clear and the implementation is well organised. The design aligns nicely with the existing CMORisation workflow, and after some basic testing I didn’t observe any obvious bugs.

There are a few edge cases around bounds handling that might be worth keeping in mind for future improvements:

For ocean variables, the current time_bnds handling is not yet fully covered. When input_data is provided as a DataArray, time_bnds can be missing, which may lead to runtime errors. In practice, this means the workflow relies on a helper function that automatically derives time_bnds from time.

A similar pattern appears for atmosphere variables with spatial bounds. For example, if bounds variables are not present, errors such as

KeyError: "No variable named 'lon_bnds'. Variables on the dataset include ['lon', 'time', 'lat', 'pr']"

can occur.

Additionally, if the dataset is loaded without decode_cf=False, time coordinates may be decoded as cftime objects, which can trigger a TypeError in atmosphere.py (around line 174) when they are implicitly cast to numeric types:

TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian'.

These don’t look like blockers for this PR, but it might be useful to document these assumptions or add some lightweight safeguards around bounds generation and time handling in a follow-up.

)
if resampling_required:

# Keep only required data variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped redundant coordinates and dimensions to prevent them from affecting other parts of the workflow. These steps were previously handled implicitly by xr.open_mfdataset() and now need to be handled explicitly.

used_dims.update(self.ds[var].dims)

# Exclude auxiliary time dimension
if "time_0" in used_dims:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_0 is a special coords and need to be handled specificly.

self.ds = self.chunker.rechunk_dataset(self.ds)
print("✅ Dataset rechunking completed")

def _ensure_numeric_time_coordinates(self, ds: xr.Dataset) -> xr.Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method to convert cftime format to numeric values

return ds_resampled, True


def calculate_time_bounds(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 methods for calculate time_bnds, lat_bnds and lon_bnds

) # Make a copy to avoid modifying original

# SAFEGUARD: Convert cftime coordinates to numeric if present
self.ds = self._ensure_numeric_time_coordinates(self.ds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a safeguard to handle cases where the input data use cftime in time-related coordinates and variables.

@rhaegar325
Copy link
Collaborator

rhaegar325 commented Jan 7, 2026

Add automatic calculation of missing coordinate bounds and fix cftime handling for CMIP6 compliance

This PR implements automatic calculation of missing coordinate bounds (time_bnds, lat_bnds, lon_bnds) and fixes cftime object handling to ensure CMIP6 compliance when processing raw model data.

Changes Made

1. cftime to numeric conversion safeguard

  • Added _ensure_numeric_time_coordinates() method to convert cftime objects to numeric values when datasets are loaded with decode_cf=True
  • Prevents TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian' in downstream operations
  • Preserves time encoding attributes (units, calendar) after conversion
  • Uses default units='days since 0001-01-01' when missing

2. Coordinate cleanup improvements

  • Implemented proper removal of unused coordinates in load_dataset()
  • Added handling for auxiliary dimensions (e.g., time_0) via isel(time_0=0, drop=True)
  • Only keeps coordinates that are actually used as dimensions by data variables
  • Prevents dimension mismatch errors in transpose operations

3. New utility functions in utilities.py

  • calculate_latitude_bounds(): Calculates latitude bounds for both regular and irregular grids, with proper handling of polar boundaries (clipping to [-90°, 90°])
  • calculate_longitude_bounds(): Calculates longitude bounds supporting both 0-360° and -180-180° conventions, with automatic detection of global vs. regional grids and proper handling of periodic boundaries

4. Enhanced calculate_time_bounds() function

  • Added time_coord parameter to support different time coordinate names
  • Added bnds_name parameter to support different bounds dimension names ("nv" for ocean, "bnds" for atmosphere)

5. Updated CMIP6_Atmosphere_CMORiser class

  • Added automatic detection of missing bounds variables during select_and_process_variables()
  • Automatically calculates missing bounds using the appropriate utility function based on coordinate type
  • Issues user warnings when bounds are missing from raw data and being auto-calculated
  • Maintains flexibility for different coordinate naming conventions (lat/latitude/y, lon/longitude/x, time/t)

@rhaegar325 rhaegar325 self-requested a review January 8, 2026 22:59
Copy link
Collaborator

@rhaegar325 rhaegar325 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. cftime to numeric conversion safeguard

  • Added _ensure_numeric_time_coordinates() method to convert cftime objects to numeric values when datasets are loaded with decode_cf=True
  • Prevents TypeError: float() argument must be a string or a real number, not 'cftime._cftime.DatetimeProlepticGregorian' in downstream operations
  • Preserves time encoding attributes (units, calendar) after conversion
  • Uses default units='days since 0001-01-01' when missing

2. Coordinate cleanup improvements

  • Implemented proper removal of unused coordinates in load_dataset()
  • Added handling for auxiliary dimensions (e.g., time_0) via isel(time_0=0, drop=True)
  • Only keeps coordinates that are actually used as dimensions by data variables
  • Prevents dimension mismatch errors in transpose operations

3. New utility functions in utilities.py

  • calculate_latitude_bounds(): Calculates latitude bounds for both regular and irregular grids, with proper handling of polar boundaries (clipping to [-90°, 90°])
  • calculate_longitude_bounds(): Calculates longitude bounds supporting both 0-360° and -180-180° conventions, with automatic detection of global vs. regional grids and proper handling of periodic boundaries

4. Enhanced calculate_time_bounds() function

  • Added time_coord parameter to support different time coordinate names
  • Added bnds_name parameter to support different bounds dimension names ("nv" for ocean, "bnds" for atmosphere)

5. Updated CMIP6_Atmosphere_CMORiser class

  • Added automatic detection of missing bounds variables during select_and_process_variables()
  • Automatically calculates missing bounds using the appropriate utility function based on coordinate type
  • Issues user warnings when bounds are missing from raw data and being auto-calculated
  • Maintains flexibility for different coordinate naming conventions (lat/latitude/y, lon/longitude/x, time/t)

Those changes allow xarray.Dataarray could be used as an input format for Moppy

@rbeucher rbeucher merged commit 0eb8d0e into main Jan 9, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants